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Estimziting Beliability From a Single 



AdEJinlstration of a tiastery Test 



A m<i^t2A.y t^t is one in which the range of possible scores is partitioned 
into k nonover lapping intervals that define various levels of student mastery. 
The familiar pass**fail test with a criterion of 75 percent correct is an esc** 
ample of such a criterion-referenced (CR) test» where k = 2. Since mastery 
tests are often used in conjunction with instructional programs that maximize 
the number of students attaining the highest mastery states and minimize the 
variability of test scores* the classical correlation between scores on parallel 
tests (or equivalently» the ratio of true to observed variance) may be attenuated 
by lack of variability and this is unsatisfactory as an indicator of CR reliabiL 
ity (Popham and Huselc» 1969). 

For this reason, Livingston (1972a, b,c, 1973) proposed the fol- 
lowing Index of CR reliability for the special case o£ k ^ 2 mastery states.v 



where X and T are observed and true scores respectively , p is the mean score and 

C is the criterion score. In words. Equation 1 is the ratio of true variance 

2 2 
plus (M-C) to observed variance plus (p-C) . Thus, possible lack of score 

variability is compensated for by the addition of the squared distance between 

2 2 
the mean and the criterion score. K (X,T) Increases as (p-C) increases for 

fixed <^(T) and a (X) , which for certain distributions is indicative of the 

fact that assignment to mastery states is stabler because scores do not cluster 



(1) 



about C. However » for a syMietric, blmodal 




as C moves away from M toward either of the two nodeB — even though asBignment 
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to mastery states is more stable at ]i than at the loodes. Livingston's index 
has subsequently elicited criticism for a number of different reasons (Hamble- 
ton & Novick» 1973; Harris* 1972» 1973; Shavel&on» Block & Ravelch» 1972; Raju» 
Note 1). 

Harris (Note 2) thus proposed another coefficient for the case fe 2 — 
the squared correlation between mastery state» scored 0 and 1» and total score. 
In analysis of variance terms» this is a strength of relationship index given 



where SS^ and SS^ are betveen and within sums of squares from a one-way analy- 
sis of test score variance for the fe = 2 groups defined by criterion C. How-- 

2 

gv^rj HaXrls notes* ^or .symmetric distributions^ the maximum value- of- ti^ oc- 
curs at C = M when the proportions in the two groups equal one-half. In the 

2 

case of a symmetric* unimcdal distribution this implies that Mq Is largest when 
C « M is at the point of greatest score density andt thus when assignment to 
mastery states is relatively unstable. 

More recently, Hambleton and Novick (1973) have suggested that an index 
of CR reliability reflect the degree to which students are consistently as- 
signed to the same mastery states across parallel test administrations, as mea- 
sured by some coefficient of agreement across testings. Accordingly, Swamina" 
than, Hambleton and Algina (1974) proposed that the proportion of students con- 
sistently assigned to mastery states across two testings serve as an estimate, 
i.e. , 



by: 



(2) 



fe 
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th 

vhere p^^ is the proportion of students coosietently assigned to the i 
mastery state across the two administrations. Actually* Svaminathan^ Hainble- 
ton and Algina recomaend that a simple function of be used-^fiamely* the 
proportion of consistent assignments beyond that expected by chance. 

Most recently* Marshall and Haertel (Note 3) have suggested a single 
test administration coefficient of agreement estimate. Their method is one 
of computing the average p^ across all possible split-halves of a single test 
(denoted $ because it is cdaputationally analogous to the classical a coef* 
ficient) and then stepping up 0 by a Spearman-Brown type formula to obtain an 
estimate for the full*length test. Initial results based on simulated data 
seem to indicate that the Marshall-Haertel index behaves in a reasonable man- 
ner for different score distributions and criteria G (Marshall* Note 4)> i.e.> 
the coefficient incr eases an d-decr^aaes^pprppriately^as^criterlon^C^jls^^varj-^ — 
iously set at points of light and heavy score concentration. However, the 
derivation of the index was basically enqpirical rather than theoretical, and 
thus many of its statistical properties are presently unknown. The purpose of 
the present paper is to propose an alternative, single-administration coeffi^ 
cient of agreement estimate that is based on well-known statis,tlcal theory. 

The Mathematical Model 
Let us begin by formally defining the coc^^^Cctent 0^ OQAZmont ^ok m 
indw^jdajal I as the probability that 1 is assigned to the same mastery state 
on parallel tests X and X*. The model for the case of » 2 mastery states de-^ 
fined by criterion score C is outlined here; but the model extends easily to 
k> 2 mastery states defined by multiple criteria **^fe**l* 
there are two ways that an individual 1 can be assigned to the same mastery 
state on parallel testa X and X' with criterion C: (1) X. > C and X* > C 
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indicating consistent mastery/mastery decisions and (2) < C and X* < C in- 
dicating consistent noBmastery/nonmastery decisions. (There are also tuo ways 
that inconsistent decisions can arise; (1) X^ >^ C and X^ < C, and (2) X^ < C 
and X^ > C.) Thus the coefficient of agreement for person i can be 



written: 



P^^^ - H\ > C, X[ > C) + P(X^ < X^ < C) (4) 



where the terms on the right side of Equation 4 are the probability of consis^ 
tent mastery /mastery and nonmastery/nonmastery decisions respectively. Equa- 
tion 4 might be of interest to educators who want to determine the reliability 
of a mastery test for making decisions about a particular person in an indivi- 
dualized instructional program. 



as the mean of the individual coefficients; 



C N (J 
I [P(x, > XJ > C) + P(X. < XJ < O] 

i-=l 3- 3- 1 ^ 

N 

Equation 5 is the sum of the probability of a consistent decision for each per- 
son i weighted by his or her probability of occurrence in the groups and so 
again represents the (group) probability of a consistent decision on parallel 
tests. 

Let us now introduce two assumptions that make possible the estimation 
of the individual coefficient of Equation 4» and thus also the estimation of 
the group coefficient in Equation 5. The first assumption is that scores X^ 
and X^ are independently distributed for a fixed person i (Lord and Novick/ 
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1968). Under this aBBumptlon Equation 4 can be rewritten: 

Pi^^ « P(X, > C) *P(X] > C) + P(X^ < C) *P(X* < C) (6) 

ThlB assumption IcqilleB that the experience of taking test X does not affect 
the outcome on test X' for person 1 or vice versa; and Its validity would de- 
pend upon the degree to which content and administration of the two tests are 
separate. 

The aecond assumption is that the distributions of and X^ for a fixed 
person are identically binomial in form (Lord & Novick» 1968) . This implies 
that each of the n items on a test is scored 0 and 1 and also that the exper- 
ience of taking earlier test items does not affect outcomes on later items. 
Under this assumption Equation 6 simplifies to 



[P(X^ >C)]^+ [P(X^ <C)3^ 
[P(X^ >C)]2+ [1 -P(X^ >c)32 



(7) 



where 



> C) » X (^")p^l(l - p^""^! (8) 



The quantity In Equation 8 Is the true probability of a correct item re- 
sponse for person 1, which can be estimated from his or her observed score on 
a single test» e.g.» ^ ^i/^* illustrated' in a later section* the prob- 

ability of consistent classification for each person can be estliEt^ted by Equa- 
tions 7 and 8 and for an entire group by Equation 5» using the data from a 
single test administration. 

Furthermore* the marginal group probability of assignment to the mastery 
(nonmastery) state is the same for both X and X' under the assumption of iden- 
tically distributed X^ and Xj^; and the group probability of a consistent de-* 
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cision due to chance vlth criterion C is: 

P > C)'P(X» > C) + P(X < C)*P(V < C) 

chance/G — ^ ^ ^ 

" [P(X >C)]^+ [1 - P(X >C)]^ 
H 

I P(X^ > C) 

Where P(X > C) = ^ (10) 

Thus the group probability of a consistent decision beyond that expected by 
chsnce ia given by the kspps coefficient (Cohen» 1960» 1968> 1972; Swaminathsn* 
et al .> 1974); 

P ^ P 
^ ^ C chaace/C 

tc (IIJ 

chance/C 

where P^ is given by Equation 5 and P(;j|^j^(;g/(j given by Equationa 9 and 10, 
At thia point, it may be interesting to reflect on a more general maJihemaJLir^. 



cal model of which Equations 6, 7 and 8 conatitute a special caae. Figure 1 



Inaert Figure 1 about here 



repreaenta the outcomes over repeated, joint adminiatrationa of parallel teata 
X and X* to person i with criterion C* The hatched areaa of Quadranta I and 
III repreaent conaiatent deciaiona. The eaaential probl^ is one of determine 
ing the proportion of the bivariate diatribution that falls in theae two quad- 
ranta, given data from a single teat administration* The binomial model ia a 
logical firat choice becauae it ia relatively aimple and yet flexible enough 
to account for the change in different atudenta^ diatributions of acores, aa 
their true abilitiea vary from near the **floor^* of a teat through the midrange 
and to the "ceiling" (aee Lord and Novick, 1968, p. 510)* However, more complex 
modela probably provide a more accurate deacription of reality in moat testing 
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situations. For example. Equation 8 might be replaced by a compound binomial 
model (lord & flovlck, 1958, pp. 524-526): 

P(X^>C)« I {(^")p^ia-p^)""\ +A^8(X^)} (12) 
X^*C 1 



/^and B(X^) above are defined by: 



tt^tt-l)S?p (l-pj 
A. = ^ (13) 



^ 2[M^(ft-M^) - S J - nSJ] 



8(X^) = f (-l)^\2j(J-2_^jj,X^-V(,_j^)(M-2)-(X^-v) 

v=0 1 

2 2 
In Equation 13 Is the variance of the n item difficulties; and are re- 

respectively the mean and variance of test scores for the group. 



^Eatimating^p^^ 1 . 



The cotoputational process of the previous section is set in motion by 
estimating the probability of a correct item response for each person from 
the observed data. P(X^ >^ C) can then be computed by Equation 8 for the simple 
binomial model or Equation 12 for the compound binomial model» followed by 
Equations 7 and 5 or by Equations 9-11. The present section considers 

various ways of estimating p^. 

Sample. Zimrtual Uodzl 

The traditional (maximum likelihood) estimator of p^ is the proportion of 
test items correctly answnred by person i; 

p^ ^ \ln (15) 

vhere is the number correct and n is the total number of items. Since 
the standard error of estimate in this case is /p^(l-p^)/H> Equation IS should 
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lead to reasonably accurate results ±£ n > 40» particularly if the mastery 

level of most students is well above (below) =■ .50. 

However* Equation 15 does not include certain collateral information*^ 

2 

such as mean and variance that is available in group testing, situations. 
When the number of items ^ is small* the inclusion of such information is par- 
ticularly important for obtaining better estimates of than those given by 
Equation 15. For example* if the distribution of observed scores X for 
the group approximates some member of the negative hypergeometric family of 
unimodal distributions (see Lord & Noviclc* 1968, p. 519 for illustrations) a 
better estimate of p^ is given by the regression equation: 

where a ^^-ITTtl-^ 5 — 3' the Kuder-Richardson -Formula 21 reli-^ 

ability coefficient (which is the squared correlation between observed and 

true score under the simple binomial model.) 

Equation 16 assumes that person i is a member of a unimodal distribution 

2 

with mean and variance Howaver, mult£mo<fal situation^ ate possible if 
different grade levels are present or if the test items are designed to dis*- 
criminate very sharply between masters and nonmasters. Blind use of Equation 
16 in such situations can lead to erroneous p^ estimates because the means and 
variances of the separate populations may be very different from the mean and 
variance for the combined data. If the various populations are clearly dis*- 
tinguishable, a separate regression equation lilce (16) can be derived for each 
group. However, an estimation procedure for p^ that employs collateral infor*- 
mation and yet is free of distributional assumptions has obvious advantages. 
One such estimate is given by (Lord and Novick, 1968, p. 514): 

10 



9 Estimating Reliability 

where $(X-1) and <^(X) are the relative frequency of X-1 and X in the 
combined group and P^^j^ and p^ are the proportion estimates corresponding to 
scores of X-1 and X. Unfortunately* complexity is the pricu that one pays for 
the generality of Equation 17. Accurate estimation of $(X*1) and <^(X) require 
a large sample of subjects. Additionally* since (17) represents n equations 
in Krfl unknowns* the researcher must specify one of the Py^^-j^ values to set the 
estimation process in motion* e.g.* if X-1 is a chance score on an m^option 
multiple choice test one might set p^^^^ * l/m. See Lord (1959) for examples 
of the use of Equation 17. Further pursuit of simple* yet general^i procedures 
for estimating with small n is clearly indicated. 



Compound BinonUal Uodzl 

The procedures here are analogous to those above. If ft is large the 
classical estlmte of Equation 15 can be used* 

However* the following regression estimate includes collateral information 
about the mean* variance and item difficulties for a unlmodal distribution: 

J='i ""20<^n^ <^-''20>^T> ^^^^ 

where 0.^^ is the Kuder-Richardson Formula 20 reliability coerficient 
(which is the squared correlation between observed and true scores urider the 
compound binomial model) . 

Examples 

In order to illustrate the computation of the individual and group coef- 
ficients and Pp* the simple binomial model will be applied first to a small 
set of stimulated data and then to real data. 
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As shown in Table 1» the true probability of a correct item response p. 



Insert Table 1 about here 



was specified for each of W = 10 hypothetical subjects. An observed score 
on an n " 5 item test was generated for each subject uaing p^. For example* 
a random unit was drawn indicating Person l^s performance on each of the five 
items as follows: 9» 3» 6» 5» 2. Since Person l^a probability of a correct 
response is = .2, Units 0-1 were scored as correct and Units 2-9 as incor- 
rect» accordingly 0 as shown in Table 1. 

These single-administration X, scores are used to estimate P^^^ and 
where C « 4. First the probability of a correct response p^ for each student 
is estlmp*"ed by Equation 16. Next, p^ is substituted into Equation_8^wi^_ _ 



C « 4 and n ^ 5 to obtain P(X^ >. 4) and its complemOTt 1 - P(X^ > 4) for each 
atudent. P(X^ >^ 4) and 1 - P(X^ >^ 4) are squared and summed according to Equa-- 
tion 7 to provide an estimate p^^^ of each individual's coefficient cf agree- 
ment; and finally the group coefficient of agreement is the mean of the p^ 
column as indicated by Equation 5, i.e., P^ = 7.5196/10 ^ .75. 

As a check on the reasonableness of the estimate above a second aet of 
Xj^ scores, ahown in the last column of Table 1, were generated in the same 
way as the X^ scorea. Since eight of the students are consistently classified 
aa master/master or nonmaster/nonmaster on both tests with C ^ 4 (Students 2 
and 8 being the exceptions), the two-administration estimate of the coefficient 
of agreement ia p^ 8/10 = .80. A comparison of the one- and two- 
administration estimates across criteria C = 1, 2, 3, 4 for the example of 
Table 1 indicatea a median difference of 3 percent between the two indices. 

However, the proof of the pudding is in the eating; so let us now consid- 
er aome real teat data. In 1974, Form 4B of the Mathematics Baaic Concepts 
Subteat (Sequential Tests of Educational ^ ^ 

i2 
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Progress — Series II) was admin:lstere<l in Grade S of the Madisoii Public Schools. 
This is a 50-item multiple-choice test of factual recall, mathematical manipu- 
lation, and so forth. A group of N » 30 students was selected for analysis, 
and odd-item score and an even*^item ccore Xj^ were computed for each stu- 
dent, providing uniToodal distributions of scores on two, roughly parallel 
tests of n <* 25 items. Summary statistics for the unimodal scores X^ and Xj^ 
were as follows: (a) = 17.40 and M^' = 17.17, (b) ^ 5.14 and S^' « 4.47. 
As in Table 1, a single-administration estimate based on the X^ scores was com^ 
pared for reasonableness to a corresponding dual-administration estimate based 
on both X^ and X^^ scores. The results are shown in Figure 1. 

Since the distribution of X^ in Figure 2(^) has small variance, as might 
be expected on a criterion referenced test, the norm referenced reliability 
coefficient a^^ is seriously attenuated--a2j^ ^ (25/24) [l-C^ .40) (25-17.40)/ 
(25x5.14)j; 0. Thus by Equation 16, - 0(X^/25) + (1-0) (17 .40/25) ^ .70 
for each of the 30 students. Using the procedure outlined in Table 1 a value 

A 

of was computed for C « 10,11 25 as indicated by the broken line in 

Figure 2. The two^admlnistratlon estimate (Swaminathan, et al ., 1974) based 
on both X^ and scores was also computed for the same C values, as Indicated 
by the solid line in Figure 2. The median difference between the two curves 
is 3 percent across criteria C 10,11 25. 

A 

Figure 2 illustrates that Is a reasonable estimate in the sense that 
it increases and decreases at points of light and heavy score density (*) In 
the same ^ay as p^* However, it would be most unwise to draw conclusions about 

A 

the accuracy of P relative to p on the basis of this single data set. In 
u o 

A 

this particular case, P^ generally provides a conservative estimate of the pro*^ 

portion of consistent decisions relative to p . This can be accounted for by 

o 

two factors: (a) X^ and X^ are not based on independent administrations as 

13 
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A A 

aasumed by the model for P^, so estimates tend to be larger than 
eatimatea; and (b) the aimple binomial model is an approximation to reality. 
In regard to the latter point» theory auggesta that the conqiound binomial 
model of Equationa 12-14. would further enhance the agreement between 
the curvea of Figure 2. 

Generalization to k Mastery Levels 
Suppoae there are k posaible maatery levela defined by - 1 criteria 

Cj^,C2 ^fe**i* example fe = 3 maatery atates like below-average, average 

and above-average might be defined by two criterion scores Cj^, Then the 

probability that person i ia consistently claaaified is given by a general 
form of Equation 4; 

- ip(x^ < c^)]^ + rp(Ci < < c^nh . . . 

^ere the aecond line of Equation 19 again followa from the assumption that 
and X* are independently and identically distributed. If X^ is again aaaumed 
to have a simple or compound binomial distribution, each term in the bottom 
line of Equation 19 can be estimated by summing binomial probabilities aa in 
Equation 8 or by summing compound binomial probabilitiea as in Equation 12, 
For example, if the siitrple binomial model id asamed, 

nc^ix^< c^y-Y - P^^-^t. 
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The group probability of consifitent classification is then obtained by 

averaging the P^^p « as in Equation 5; 
^r2*"Vl 

W ... 

p « i°l 1' 2"' fe"l ^20) 

^1^2*"^fe-l M 



Finally^ Equation 9 can be written more generally to obtain the group 
probability of consistent classification due to chance with criteria 
^1^^2^'"'^fe-l follows; 



+ [P(V2<X<V,)]' + [P(Cfe.,<X)]2 



(21) 



wiiere, for example, P(C^ < X < la obtained as In Equation 30: 

P(C^ < X < = -i^J^ (22) 

Coefficient kappa ic Is then obtained aa in Equation 11, aubatltutlng 
P_ _ „ and - _ as defined above. 
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Table 1 

Estjjnation of and Using Simulated Data for Ten Persons on a Five 

Item Mastery Test With Criterion C = 4 
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^ = 2.30, 5^ = 2.61, 021^^ = .58 
^ M^, = 3.00, 5^, = 1.40, a21/K* " '^^ 
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Figure Captions 

Figure 1* Outcomes Over Repeated Administrations of Parallel Tests to an 
Individual 

Figure 2* Comparison of One- and Two-Administration Indices For Various 
Criterion Points in a Unintodal Distribution 
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